merge conflict resolution with main#36
Conversation
Large model downloads via huggingface_hub often hang or fail around 10GB. This adds a pre-download step with configurable retry/timeout before load_model() is called, so interrupted downloads can be resumed. New CLI flags for `serve`: --download-timeout, --download-retries, --offline New subcommand: `vllm-mlx download <model>` for pre-warming caches Closes #75 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The output_token_ids from AsyncEngineCore were tracked internally but never forwarded to GenerationOutput, leaving tokens always []. Also adds tests for the generate() output fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse MiniMax-M2.5's XML tool call format:
<minimax:tool_call>
<invoke name="function">
<parameter name="arg">value</parameter>
</invoke>
</minimax:tool_call>
Handles single/multiple tool calls, JSON parameter values,
no-parameter calls, and preserves <think> blocks.
9 unit tests included.
…n streaming parser
The streaming reasoning parser (BaseThinkingReasoningParser) scans the full
accumulated output text for <think>/<think> on every token via `in` checks on
previous_text and current_text. This is O(N) per token and O(N²) over a full
generation, becoming measurable at longer outputs (5ms+ at 2k tokens, 141ms
at 10k tokens).
Replace with a three-phase state machine (pre_think → thinking → content) that
tracks transitions using only the delta text. Each token is now O(1) regardless
of output length.
Benchmark results (streaming parser overhead, simulated server loop):
Tokens Old (scan) New (state) Speedup
------ ---------- ----------- -------
500 0.37ms 0.04ms 8.6x
1000 1.38ms 0.10ms 13.5x
2000 5.28ms 0.28ms 19.1x
5000 34.03ms 2.05ms 16.6x
10000 141.26ms 10.16ms 13.9x
At 50 tok/s decode on Apple Silicon, each token has a 20ms budget. The old parser
consumed 0.3ms/tok at 2k tokens and 1.4ms/tok at 10k — up to 7% of the budget
on overhead alone. The new parser is <0.01ms/tok at any length.
Changes:
- think_parser.py: Rewrote extract_reasoning_streaming() as a state machine with
_phase tracking. reset_state() initializes the phase. All three scenarios
preserved (explicit tags, implicit mode, no tags). Method signature unchanged
for backward compatibility.
- benchmarks/bench_reasoning_parser.py: Added streaming parser benchmark.
No changes to extract_reasoning() (non-streaming path) — it only runs once per
request and is not on the hot path.
Add _normalize_messages() to server.py and call it in all request paths before apply_chat_template. Maps non-standard roles (developer -> system, per OpenAI Responses API) and merges consecutive same-role messages. Fixes agent crashes from: - OpenAI Responses API sending role="developer" (unrecognized by Qwen3.5 template) - OpenCode sending [system, system, user, user] (rejected by alternating-role templates) Applied in create_chat_completion (both MLLM and LLM paths), create_anthropic_message, and _stream_anthropic_messages.
Add detection and inference support for Google's Gemma 4 models (e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision and audio capabilities via mlx-vlm >= 0.4.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Patch gemma4 Attention to snapshot cache.offset before mutation (mx.array.__iadd__ is in-place, causes wrong RoPE positions) - Add Gemma 4 reasoning parser with channel name stripping (strips "thought"/"response" prefixes, supports both <channel|> and <|channel>response transition formats) - Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation - Add 16 Gemma 4 parser tests (non-streaming + streaming) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…okenizer - Accept RotatingKVCache (used by Gemma 4) in batch cache validation - Add missing return statement in load_model_with_fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This depends on PR 215 or PR 243 being applied first.
error responses with token=0 were falling through to the detokenizer and decoding garbage text. now they skip decoding and set the request status to FINISHED_ABORTED. added a test for this case. also ran black on batched.py to fix CI.
feat: add Gemma 4 multimodal model support
- Fix BatchKVCache offset bug: mx.array.__iadd__ mutates in-place, causing incorrect RoPE positions and token repetition - Fix RotatingKVCache.max_size returning mx.array instead of int - Add Gemma 4 reasoning parser (--reasoning-parser gemma4) - Read additional EOS tokens from generation_config.json - Fix RotatingKVCache prefix cache extraction (negative left_padding) - Relax isinstance guard to accept RotatingKVCache for sliding window models like Gemma 4 (fixes ValueError on continuous batching) - Remove unused _make_batch_cache() dead code - Fix Anthropic endpoint JSON parsing for clients sending invalid escape sequences (e.g. \s, \d in regex patterns within tool defs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache
* test: add Gemma 4 tool parser tests (red) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Gemma 4 tool call parser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: register Gemma 4 parser, add streaming tests and wiring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add edge case tests for DC review findings - Unclosed tool call block (server fallback path) - String containing colon (step-ordering guard) - String with real newline and double quote (JSON escaping) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: verify Gemma 4 tool calls produce exact OpenAI format for Claude Code Integration tests that verify the full pipeline (parser → server models → JSON serialization) matches what Claude Code expects: tool_calls structure, null content, function.arguments as JSON string, correct finish_reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add Gemma 4 auto-detection to AutoToolParser integrates Gemma 4 format as the first format tried in auto-detection, adds streaming markers for tool call start/end. based on keegoid's approach in #254. * remove unused pytest imports * run black on tool parser, tests, and server --------- Co-authored-by: Jack Neil <jackneil@Jacks-Mac-Studio.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
…258) Extends MLLM batch generator to support top_k, min_p, and presence_penalty alongside the existing repetition_penalty. This gives the MLLM path full parity with the LLM/SimpleEngine sampling parameter coverage. Changes: - MLLMBatchRequest: add top_k, min_p, presence_penalty fields - MLLMBatch: add per-request samplers list (filter/extend support) - _process_prompts: build per-request logits processors for presence_penalty and per-request samplers for top_k/min_p - _step: accept and apply per-request samplers - SamplingParams: add presence_penalty field - MLLMScheduler: propagate new params from kwargs to batch requests - BatchedEngine: pass new params through generate/stream_generate When a request uses default values (top_k=0, min_p=0.0, presence_penalty=0.0), no extra processors or samplers are created — zero overhead for standard requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Fix Qwen3.5 hybrid paged cache reconstruction * fix: add deduplication safety test and remove duplicate tokenizer hunk Add test confirming deduplicated terminal blocks correctly isolate recurrent state per sequence. Remove the duplicate tokenizer return fix that already ships in PR #215. * style: format hybrid cache follow-up
* fix: keep simple engine serialized across cancellation (#8) * fix: avoid nested simple engine generation locks * fix: catch BaseException in cancellation handler, fix async test markers _run_blocking_serialized catches CancelledError (a BaseException subclass) from the outer scope, but the inner try/except used Exception which would let a second CancelledError during await task escape unhandled. Changed to BaseException to suppress any exception from the draining await. Also fix test_simple_engine.py to use pytest.mark.anyio instead of pytest.mark.asyncio (pytest-asyncio is not configured), and add the anyio_backend fixture to conftest.py restricting to asyncio only since trio is not installed. * fix: preserve prompt token accounting after upstream refresh * fix: restore specprefill fallback helper scope
…#221) * Fix chunked prefill for mlx-lm prompt checkpoints * fix: invoke prompt_checkpoint_callback in chunked-prefill path The upstream BatchGenerator contract requires prompt_checkpoint_callback to fire after cache finalization, before the checkpoint tail model call. The chunked-prefill monkeypatch preserved the checkpoint field but never invoked the callback, breaking the upstream checkpoint contract. Wire _lazy_extract_cache from mlx-lm and invoke the callback at the correct semantic boundary. Add regression test verifying the callback fires with the correct uid and checkpoint offset. * test: cover checkpoint tail replay on upstream refresh * style: format prompt checkpoint refresh * fix: tolerate mlx-lm Batch export drift in chunked prefill
fix: populate tokens field in BatchedEngine.generate()
Upgrade mlx-vlm and torchvision so Qwen3.5 multimodal will run
* fix(server): integrate tool call parser into reasoning parser streaming path * use _model_name instead of request.model in reasoning tool chunk --------- Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
When tool_choice='none', models should never return tool calls. Two fixes: 1. Strip tools from chat template context — prevents templates from activating tool-call token generation. 2. Suppress tool call parsing — _parse_tool_calls_with_parser() returns early with no tools, streaming parser skips initialization. Applied across all server paths: chat completions (streaming + non-streaming), Anthropic adapter (streaming + non-streaming). Fixes #162
) Claude Code injects `x-anthropic-billing-header: cc_version=...; cch=HASH;` into the system prompt. The `cch=` hash changes with every request, causing token sequences to diverge at position ~40 and completely defeating prefix cache reuse across turn boundaries. Strip this header before tokenization so consecutive requests from the same conversation share 99%+ of their token prefix. Result: 50s → 3.65s per request (13.7x speedup) on Gemma 4 26B-A4B with 60K-token prompts. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…fill New files: - patches/qwen3_5_mllm.py: BatchKVCache offset fix for Qwen3.5 - patches/qwen3_5_mtp.py: Runtime MTP injection for Qwen3.5 - tool_parsers/minimax_tool_parser.py: MiniMax-M2 tool parser - scripts/add_mtp_weights_qwen35.py: Extract MTP weights from BF16 Key changes: - mllm_batch_generator: chunked prefill, mid-batch extend, MTP hooks, patch registration, repetition penalty, prefill abort, think-suffix stripping for prefix cache - mllm_scheduler: request status, cache config, prefill abort - server: enable_thinking, tool_choice=none, tool argument coercion - engines: MTP injection, enable_thinking, gpu_memory_utilization - memory_cache: block LCP for hybrid models (SSM can't be rewound) Prefix cache fix: enable_thinking=True adds <think>\n to generation prompt, breaking PREFIX match across conversation turns. Strip these tokens from cache keys in both store and fetch paths so stored entries match as clean prefixes. Tested: 3.12s → 0.39s (8x) for 1400-token prompts on Qwen3.5-122B hybrid model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-format Add <function=name> format support to Qwen tool parser
fix: import Path in tokenizer utils
…-leak fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak
…e-machine perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)
feat: add MiniMax tool call parsing support
…r-restack cli: expose harmony and gpt-oss tool parsers
* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts
The try/except block computing `tokens` via tokenizer.encode() was unused -- the return statement already reads from final_output.tokens.
…stack simple-engine: keep tool chat on the streaming execution path
…, repetition_penalty) (#213) Pass all OpenAI-compatible sampling parameters through to mlx-lm's make_sampler and make_logits_processors. Previously only temperature, top_p and max_tokens reached the engine — top_k, min_p, presence_penalty and repetition_penalty were silently dropped. Changes: - api/models.py: Add fields to ChatCompletionRequest and CompletionRequest - request.py: Add presence_penalty to SamplingParams dataclass - server.py: Extract and pass all params in every code path (6 locations), log all params on request - models/llm.py: Build sampler with top_k/min_p, build logits_processors for presence_penalty/repetition_penalty - engine/simple.py: Fix enable_thinking to read VLLM_MLX_ENABLE_THINKING env var instead of hardcoding based on model name Tested with all 4 Unsloth Qwen 3.5 sampling profiles on 122B model.
* compatibility with mlx-lm 0.31.x BatchGenerator API The backport in f61d34e assumed internal BatchGenerator APIs that were refactored in mlx-lm 0.31.x. This breaks bench and serve for all users on v0.2.7. Changes: - Set prompt_progress_callback as instance attribute instead of passing it to BatchGenerator constructor (not a valid parameter) - Guard _install_chunked_prefill with hasattr check and log warning when skipped (relies on removed _process_prompts, active_batch) - Handle next() returning (prompt_responses, generation_responses) tuple instead of flat list - Add hasattr guard for active_batch in periodic cache eval Benchmark (Llama-3.2-1B-Instruct-4bit, mlx-lm 0.31.2): Total time: 2.38s Prompts: 10 Prompts/second: 4.19 Total prompt tokens: 80 Total completion tokens: 960 Total tokens: 1040 Tokens/second: 402.52 Throughput: 436.06 tok/s Closes #293 * bump to 0.2.8
* feat: add --prefill-step-size CLI flag Expose prefill_step_size as a CLI argument for both serve and bench commands. Default of 0 means "use engine default" (2048 for LLM, 1024 for MLLM), preserving existing behavior. Vision models routinely exceed 1024 tokens per prompt (images alone contribute 1400+), hitting the MLLM batch generator's safe limit. This flag lets users raise the limit without patching source code. * Clarify MLLM prefill step override behavior * refactor: clarify MLLM prefill CLI flag and validate override
…er stream_generate (#266) stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR #222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR #222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR #265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.
* feat(api): per-request SpecPrefill overrides on /v1/completions ChatCompletionRequest already accepts per-request `specprefill` and `specprefill_keep_pct` overrides, and /v1/chat/completions threads them into engine.chat(). CompletionRequest does not, so /v1/completions clients cannot opt a single request into (or out of) SpecPrefill, nor tune the keep percentage per request. Changes: - vllm_mlx/api/models.py: add specprefill and specprefill_keep_pct to CompletionRequest, matching the existing ChatCompletionRequest fields. - vllm_mlx/server.py::create_completion: extract both and thread into engine.generate(**gen_kwargs), mirroring the pattern used at server.py:1421 in create_chat_completion. - vllm_mlx/server.py::stream_completion: apply the same extraction so streaming /v1/completions clients get the same control. Both new fields default to None, so existing behavior is unchanged for clients that do not set them. No schema changes to ChatCompletionRequest. No engine-side changes needed: SimpleEngine.stream_generate already consumes these kwargs (see simple.py:307-308). * style(server): align completions kwargs handling
Review Summary by QodoPrefix cache, MTP support, tool improvements, and streaming enhancements with comprehensive testing
WalkthroughsDescription• **Prefix cache and KV cache optimization**: Added MemoryAwarePrefixCache support with chat template normalization, cache extraction/merging with RotatingKVCache buffer trimming, and hybrid model support (ArraysCache) • **Multi-Token Prediction (MTP) support**: Implemented MTP injection for Qwen3.5 models via inject_mtp_support(), added MTP weight extraction script for Dense/MoE architectures, and integrated MTP into scheduler with always-advance verification • **Chunked prefill with prompt checkpoints**: Enhanced chunked prefill to support prompt checkpoints (positive values for token positions, non-positive for offsets) with checkpoint tail replay and callback support • **Client disconnect detection**: Added PrefillAbortedError exception and prefill abort tracking with heartbeat SSE comments in _disconnect_guard() to detect disconnects during long prefill operations • **Tool call improvements**: Implemented tool argument coercion via _coerce_tool_arguments() to fix LLM tool failures, added Gemma 4 and Qwen function format parsers with streaming buffering support, and improved tool call filtering • **Message normalization**: Added _normalize_messages() to map non-standard roles (e.g., "developer" → "system") and merge consecutive same-role messages for chat template compatibility • **Reasoning parser enhancements**: Refactored streaming parser to state-machine approach with three phases (pre_think, thinking, content), added Gemma 4 reasoning parser support • **Per-request sampling parameters**: Added forwarding of top_k, min_p, presence_penalty, repetition_penalty through all generation paths (completion, chat, streaming variants) • **GPU memory utilization configuration**: Added --gpu-memory-utilization flag and dynamic memory pressure threshold calculation for Metal allocation limits • **Model download utilities**: Implemented download_command() with retry logic, timeout, and offline mode support; optimized VLM loading with up-front detection to avoid double-loading penalty • **Blocking operation refactoring**: Added _run_blocking_serialized() method for safe MLX operations under generation lock with proper cancellation handling • **Comprehensive test coverage**: Added tests for chunked prefill checkpoints, Gemma 4 tool/reasoning parsers, streaming chat completion, message normalization, download utilities, and streaming aggregation Diagramflowchart LR
A["Prefix Cache<br/>KV Optimization"] --> B["Batch Generator<br/>Chunked Prefill"]
C["MTP Injection<br/>Qwen3.5"] --> B
D["Tool Parsers<br/>Gemma4/Qwen"] --> E["Server<br/>Tool Coercion"]
F["Message<br/>Normalization"] --> E
G["Reasoning<br/>State Machine"] --> E
B --> H["Scheduler<br/>Request Tracking"]
H --> I["Engine<br/>Sampling Parameters"]
J["GPU Memory<br/>Utilization"] --> I
K["Download<br/>Utilities"] --> L["CLI<br/>Model Pre-download"]
L --> I
M["Blocking Ops<br/>Serialization"] --> I
File Changes1. vllm_mlx/mllm_batch_generator.py
|
Code Review by Qodo
|
| # Emergency memory pressure threshold — dynamic based on gpu_memory_utilization | ||
| _gpu_mem_util = self.config.gpu_memory_utilization | ||
| try: | ||
| _device_info = mx.device_info() | ||
| _max_recommended = _device_info.get( | ||
| "max_recommended_working_set_size", | ||
| _device_info.get("memory_size", 0), | ||
| ) | ||
| _memory_pressure_threshold = ( | ||
| int(_max_recommended * 0.85) | ||
| if _max_recommended > 0 | ||
| else 200 * 1024 * 1024 * 1024 | ||
| _device_mem = mx.device_info().get("memory_size", 200 * 1024 * 1024 * 1024) | ||
| _memory_pressure_threshold = int( | ||
| _device_mem * min(_gpu_mem_util + 0.05, 0.99) | ||
| ) |
There was a problem hiding this comment.
1. Memory threshold uses ram 🐞 Bug ☼ Reliability
EngineCore computes its emergency cache-clear threshold from mx.device_info()['memory_size'] instead of Metal’s max_recommended_working_set_size, so cache clearing can trigger far too late and lead to OOM/Metal instability. This diverges from other memory sizing in the codebase that consistently uses max_recommended_working_set_size for Metal limits.
Agent Prompt
### Issue description
`EngineCore`’s emergency memory pressure threshold is computed from `mx.device_info()['memory_size']` (physical memory), not `max_recommended_working_set_size`. On Metal, the recommended working set is the relevant ceiling; using physical memory can delay cache clearing until it’s too late, increasing risk of OOM / Metal command-buffer failures.
### Issue Context
Other parts of this repo already use `max_recommended_working_set_size` for memory sizing and limits (e.g., BatchedEngine’s `mx.set_memory_limit` and MLLM wired limit), so `EngineCore` should align with that source of truth.
### Fix Focus Areas
- vllm_mlx/engine_core.py[154-162]
- vllm_mlx/engine/batched.py[358-378]
### Implementation notes
- Prefer `max_recommended_working_set_size` when present; fall back to `memory_size` only if it’s missing/zero.
- Keep the `gpu_memory_utilization` scaling behavior, but scale the recommended working set, not physical memory.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Summary
Resolves the 7 merge conflicts between
fix/chat-template-kwargs-forwardingandmainonwaybarrios/vllm-mlx.enable_thinkingsupport (from main) alongsidechat_template_kwargsforwarding (from this branch) inbatched.py_run_blocking_serializedrefactor insimple.pywhile preservingchat_template_kwargsforwardingchat_template_kwargsthrough the new tool-stall workaround path insimple.pySee full details: waybarrios#218 (comment)